Automatic Dictionary Construction and Identification of Parallel Text Pairs
نویسندگان
چکیده
When creating dictionaries for use in for example cross-language search engines, parallel or comparable text pairs are needed. Multilingual web sites may contain parallel texts but these can be difficult to detect. For instance, a multilingual website, Hallå Norden, contains information in five languages; Swedish, Danish, Norwegian, Icelandic and Finnish. Working with these texts we discovered two main problems: the parallel corpus was very sparse, containing on average less than 80.000 words per language pair (in the final version of the corpora), and it was difficult to automatically detect parallel text pairs. We discovered that, on average, around 55 percent of the texts were not parallel. Creating dictionaries with the word aligner Uplug gave on average 213 dictionary entries. Despite the corpus sparseness the results were surprisingly good compared to other experiments with larger corpora. Following this work, we made two sets of experiments on automatic identification of parallel text pairs. The first experiment utilized the frequency distribution of word initial letters in order to map a text in one language to a corresponding text in another in the JRCAcquis corpus (European Council legal texts). Using English and Swedish as language pair, and running a ten-fold random pairing, the algorithm made 87 percent correct matches (baseline-random 50 percent). Attempting to map the correct text among nine randomly chosen false matches and one true yielded a success rate of 68 percent (baseline-random 10
منابع مشابه
Automatic Acquisition of Parallel Corpora from Websites with Dynamic Content
Parallel corpora are indispensable resources for a variety of multilingual natural language processing tasks. This paper presents a technique for fully automatic construction of constantly growing parallel corpora. We propose a simple and effective dictionary-based algorithm to extract parallel document pairs from a large collection of articles retrieved from the Internet, potentially containin...
متن کاملIdentification of Parallel Text Pairs Using Fingerprints
When creating dictionaries for use in for example crosslanguage search engines, one often uses a word alignment system that takes parallel or comparable text pairs as input and produces a word list. Multilingual web sites may contain parallel texts but these can be difficult to detect. In this article we describe an experiment on automatic identification of parallel text pairs. We utilize the f...
متن کاملAutomatically Extracting Parallel Sentences from Wikipedia Using Sequential Matching of Language Resources
In this paper, we propose a method to find similar sentences based on language resources for building a parallel corpus between English and Korean from Wikipedia. We use a Wiki-dictionary consisted of document titles from the Wikipedia and bilingual example sentence pairs from Web dictionary instead of traditional machine readable dictionary. In this way, we perform similarity calculation betwe...
متن کاملAutomatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملBilingual Dictionary Construction with Transliteration Filtering
In this paper we present a bilingual transliteration lexicon of 170K Japanese-English technical terms in the scientific domain. Translation pairs are extracted by filtering a large list of transliteration candidates generated automatically from a phrase table trained on parallel corpora. Filtering uses a novel transliteration similarity measure based on a discriminative phrase-based machine tra...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008